Power and multicollinearity in small networks: A discussion of

“Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks”

JSM 2023
Toronto, Canada

George G. Vega Yon, Ph.D.

The University of Utah

2023-08-09

Overview

Highlights Krivitsky, Coletti, and Hens (2022)

What I highlight in their paper:

  • Start to finish framework for multi-ERG models.

  • Dealing with heterogeneous samples.

  • Model building process.

  • Goodness-of-fit analyses.

Two important missing pieces (for the next paper): power analysis and how to deal with collinearity in small networks.

Power analysis in ERGMs

Sample size in ERGMs

Two different questions: How many nodes? and “How many networks?

Number of nodes

Number of networks

  • There is a growing number of studies featuring multiple networks (e.g., egocentric studies).

  • There’s no clear way to do power analysis in ERGMs.

  • In funding justification, power analysis is fundamental, so we need that.

A possible approach

We can leverage conditional ERG models for power analysis.

  • Conditioning on one sufficient statistic results in a distribution invariant to the associated parameter, formally:

    \[\begin{align} \notag% {\mbox{Pr}_{\mathcal{Y},\boldsymbol{\theta}}\left(\boldsymbol{Y}= \boldsymbol{y}\left|\;\boldsymbol{g}\left(\boldsymbol{y}\right)_l = s_l\right.\right)}% & = \frac{% {\mbox{Pr}_{\mathcal{Y},\boldsymbol{\theta}}\left(\boldsymbol{g}\left(\boldsymbol{Y}\right)_{-l} = \boldsymbol{g}\left(\boldsymbol{y}\right)_{-l}, \boldsymbol{g}\left(\boldsymbol{y}\right)_l = s_l\right) } }{% \sum_{\boldsymbol{y}'\in\mathcal{Y}:\boldsymbol{g}\left(\boldsymbol{y}'\right)_l = s_l}{\mbox{Pr}_{\mathcal{Y},\boldsymbol{\theta}}\left(\boldsymbol{g}\left(\boldsymbol{Y}\right) = \boldsymbol{y}'\right) }% } \\ & = % \frac{% \mbox{exp}\left\{{\boldsymbol{\theta}_{-l}}^{\boldsymbol{t}}\boldsymbol{g}\left(\boldsymbol{y}\right)_{-l}\right\} }{% \kappa_{\mathcal{Y}}\left(\boldsymbol{\theta}\right)_{-l} }, \tag{1} \end{align}\]

    where \(\boldsymbol{g}\left(\boldsymbol{y}\right)_l\) and \(\boldsymbol{\theta}_l\) are the \(l\)-th element of \(\boldsymbol{g}\left(\boldsymbol{y}\right)\) and \(\boldsymbol{\theta}\) respectively, \(\boldsymbol{g}\left(\boldsymbol{y}\right)_{-l}\) and \(\boldsymbol{\theta}_{-l}\) are their complement, and \(\kappa_{\mathcal{Y}}\left(\boldsymbol{\theta}\right)_{-l} = \sum_{\boldsymbol{y}' \in \mathcal{Y}: \boldsymbol{g}\left(\boldsymbol{y}'\right)_l = s_l}\mbox{exp}\left\{{\boldsymbol{\theta}_{-l}}^{\boldsymbol{t}}\boldsymbol{g}\left(\boldsymbol{y}'\right)_{-l}\right\}\) is the normalizing constant.

  • We can use this to generate networks with a prescribed density (based on previous studies) and compute power through simulation.

Example: Detecting gender homophily

  • Study gender homophily in networks of size 8.

  • On average, the focal networks have 20 ties (i.e., a density of \((2\times 20)/(8 \times 7) \approx 0.71\)).

  • Want to detect an effect size of \(\boldsymbol{\theta}_{\mbox{homophily}} = 2\), we could approximate the required sample size in the following fashion:

  1. For each \(n \in N \equiv \{10, 20, \dots\}\), do:

    1. With Eq. (1), use MCMC to simulate \(1,000\) sets of \(n\) networks of size 8 and 20 ties.

    2. For each set, fit a conditional ERGM to estimate \(\widehat{\boldsymbol{\theta}}_{\mbox{homophily}}\), and generate the indicator variable \(p_{n, i}\) equal to one if the estimate is significant at the 95% level.

    3. The empirical power for \(n\) is equal to \(p_n \equiv \frac{1}{1,000}\sum_{i}p_{n, i}\).

  2. Once we have computed the sequence \(\{p_{10}, p_{20}, \dots\}\), we can fit a linear model to estimate the sample size as a function of the power, e.g., \(n = \beta_0 + \beta_1 p_n + \beta_2 p_n^2 + \varepsilon\).

  3. With the previous model in hand, we can estimate the sample size required to detect a given effect size with a given power.

Collinearity in ERGMs

Not like in regular models

  • Variance Inflation Factor [VIF] is a common measure of collinearity in regular models.

  • Usually, VIF > 10 is considered problematic.

  • Duxbury (2021)’s large simulation study recommends using VIF between 150 and 200 as a threshold for multicollinearity.

  • In small networks, this could be more severe.

Predicting statistics

In a directed network with 5 nodes, two of them female and three male, transitive triads are almost perfectly predicted by mutual ties.

When \(\boldsymbol{\theta}_{\mbox{ttriad}} = 1\) (second row), Cor(transitive triads, mutual ties) \(\to 1\), and VIF is > 4,500.

Collinearity in small networks

  • In the same network (5 nodes), many combinations of model parameters yield high correlations and VIFs.

  • KCH’s networks were highly dense, (0.93 and 0.73 for the household and egocentric samples, respectively.) \(\rightarrow\) collinearity should be severe.

Discussion

  • Krivitsky, Coletti, and Hens’ work make an important contribution to ERG models, most relevant: model building, selection, and GOF for multi-network models.

  • Power (sample size requirements) and multicollinearity are two important issues that are yet to be addressed.

  • I presented a possible approach to deal with power analysis in ERGMs using conditional distributions.

  • Collinearity in small networks (like those in KCH) can be serious (more than in larger networks.) Yet we need to further explore this.

Thanks!

george.vegayon at utah.edu

https://ggv.cl

@gvegayon@qoto.org

References

Duxbury, Scott W. 2021. “Diagnosing Multicollinearity in Exponential Random Graph Models.” Sociological Methods & Research 50 (2): 491–530. https://doi.org/10.1177/0049124118782543.
Krivitsky, Pavel N., Pietro Coletti, and Niel Hens. 2022. “A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks.”
Schweinberger, Michael, Pavel N. Krivitsky, and Carter T. Butts. 2017. “A Note on the Role of Projectivity in Likelihood-Based Inference for Random Graph Models,” July, 1–6.
Schweinberger, Michael, Pavel N. Krivitsky, Carter T. Butts, and Jonathan R. Stewart. 2020. “Exponential-Family Models of Random Graphs: Inference in Finite, Super and Infinite Population Scenarios.” Statistical Science 35 (4): 627–62. https://doi.org/10.1214/19-sts743.